Lesson 1: Introduction

Ashir Borah

October 24th, 2024

Press the “?” key for tips on navigating these slides

Welcome!

Why learn to code as a cancer researcher?

  • Reproducible data analysis:

    • Document and share exactly how you analyzed your data

    • Do more with your analysis, more efficiently:

    • More control and flexibility

    • Use community-created analysis tools

    • Leverage cancer data resources (often large datasets)

    • Create awesome visualizations

Imagined learning curve

Our view of the learning curve

Course goals

  • Get everyone past ‘valley of despair’ in R learning curve

  • Convince you that R is an accessible and useful tool for you in your research

  • Prepare you to tackle BootCamp projects next week

  • Get you excited to keep developing these coding skills!

Why R specifically

Upsides

  • Free

  • Great for data analysis and visualization

  • LOTS of bioinformatics/stats tools available

Downsides?

  • It’s a hodge-podge

  • Not the best for engineering software

But I have excel

Quick demo

[LIVE DEMO]

Strategy of the course

  • What this course is

    • Coding basics

    • Heavy emphasis on practical skills (data wrangling, visualization)

    • Flagging areas with technical depth but giving the ‘need-to-know’

  • What this course is not

    • Intro to computer science

    • Intro to stats

Course format

Other helpful R resources

Acknowledgements

We need your help

  • Your feedback is very much appreciated (Slack, email, etc)

  • We’ll do our best to adapt as we go.

R Basics

Key Tools

  • R Markdown/Notebooks

    • Sort of like a lab notebook for analysis

    • Easily share results and methods in different formats

    • Encourages good code and analysis practices

The R console

Where the action happens!

  • Provide inputs to R

  • See outputs of commands you give it (each on separate line)

Create project directory

  • Organize your work as ‘Projects’ in Rstudio

  • Each project has a separate folder, with data, code, results.

  • Create a ‘Project’ in Rstudio

R as calculator

List of key math operations

  • *: multiplication
  • /: division
  • +: addition
  • -: subtraction
  • ^: ‘raise to the power’

Logical operations

  • == Check equals
  • != Check not equals
  • > Greater than, < less than
  • >= Greater than or equal to, etc.
2^2 == 3
[1] FALSE

Objects in R

  • Variables:

    • Store information/data (the “nouns”)
    • Come in a number of flavors
  • Functions:

    • Set of instructions to perform some task (the “verbs”)
    • We’ll come back to these in a bit

Creating variables

  • You can create new variables with <-
x <- 3 * 4
x
[1] 12
  • All object-creation statements have the form

    • object_name <- value

    • You can use =, but <- makes for better R code

color1 = 'red'
color2 <- 'red'
color1 == color2
[1] TRUE

Variable types in R

  • Numbers
x <- 1
x <- 1.592E-39
  • Strings (text)
x <- 'abc'
x <- "abc"
  • Logical (true/false)
x <- TRUE
x <- FALSE
  • Factors (categorical variables)

    • e.g. (‘bad’, ‘OK’, ‘good’, ‘great’)

    • We will mostly try to avoid these, but be aware of them.

Variable naming

  • Variable names must start with a letter, and can only contain letters, numbers, ‘_’, and ‘.’.

Not allowed: 4th, my var, weird?, etc. etc.

  • Object naming is important for writing good, readable, code

  • Make variable names descriptive

avgClicks

  • Make function names verbs

calculate_avg_clicks

NOT: var1 or a

Use comments!

  • Code readability is huge so others can understand what you’ve done

    • Including future you!
  • In RMarkdown docs, write descriptive text before each code chunk

  • Also good to add comments to key lines of code within chunks

R environment

  • See info on current variables

  • Clearing variables

  • View data tables, etc.

Data Structures

Commonly used R data structures

  • Vectors

  • Lists

  • Matrix

  • Dataframes

Vectors

  • Ordered collection of values. Like a sequence of ‘buckets’

  • Can hold numeric data

  • Or text (strings)
  • Or boolean data (TRUE/FALSE)
  • All data in a vector has to be the same type!

Making vectors

  • Use ‘combine’ function c()
num_vec <- c(1, 2, 3, 4)

log_vec <- c(TRUE, TRUE, FALSE, F)

str_vec <- c('this', 'is', 'a', 'vector', 'of', 'strings')

print(num_vec)
[1] 1 2 3 4
  • Shorthand to create a sequence of integers
1:4
[1] 1 2 3 4

Missing values

  • Quick notes on missing values in R (will be important)

  • NA (‘not available’) is a special value for missing data that can be included in any type of vector

c(1, 2, NA)
[1]  1  2 NA
c('a', 'b', NA)
[1] "a" "b" NA 

Adding more data on to vectors

c() can also be used to add new elements to a vector

string_vec <- c("TP53", "PLEC", "DSPP", "PIK3CA")
string_vec2 <- c(string_vec, "BRAF")
string_vec2
[1] "TP53"   "PLEC"   "DSPP"   "PIK3CA" "BRAF"  

Combining two vectors

string_vec <- c("TP53", "PLEC", "DSPP", "PIK3CA")
string_vec2 <- c("BRAF", "EGFR", "DUSP4")
c(string_vec, string_vec2)
[1] "TP53"   "PLEC"   "DSPP"   "PIK3CA" "BRAF"   "EGFR"   "DUSP4" 

Lists

Lists are basically like relaxed vectors, where elements don’t have to be the same type

z <- list('a', 1, TRUE)
z
[[1]]
[1] "a"

[[2]]
[1] 1

[[3]]
[1] TRUE

You can combine lists with the c() function as with vectors

z <- list('a', 1, TRUE)
c(z, 'c')
[[1]]
[1] "a"

[[2]]
[1] 1

[[3]]
[1] TRUE

[[4]]
[1] "c"

Matrix

  • Like vectors, but arranged in 2d with rows and columns
  • Example: gene expression values across genes and samples

Dataframe

  • Most common way of interacting with data

  • Each column is a vector, and they can hold different kinds of data

  • Like an Excel table.

Key concepts recap

  • Use RMarkdown documents like an experiment notebook

  • Create variables with <-, naming them is important

  • Variables can be numbers, text (strings) or TRUE/FALSE (boolean)

  • Data organized as vectors, lists, matrices, and dataframes

  • Create/add to vectors (or lists) with c()

  • Make lists with list()